Goto

Collaborating Authors

 feature engineering




A Data-Centric Perspective on Evaluating Machine Learning Models for Tabular Data

Neural Information Processing Systems

Tabular data is prevalent in real-world machine learning applications, and new models for supervised learning of tabular data are frequently proposed. Comparative studies assessing performance differences typically have model-centered evaluation setups with overly standardized data preprocessing. This limits the external validity of these studies, as in real-world modeling pipelines, models are typically applied after dataset-specific preprocessing and feature engineering. We address this gap by proposing a data-centric evaluation framework. We select 10 relevant datasets from Kaggle competitions and implement expert-level preprocessing pipelines for each dataset. We conduct experiments with different preprocessing pipelines and hyperparameter optimization (HPO) regimes to quantify the impact of model selection, HPO, feature engineering, and test-time adaptation. Our main findings reveal: 1) After dataset-specific feature engineering, model rankings change considerably, performance differences decrease, and the importance of model selection reduces.


EfficientECG: Cross-Attention with Feature Fusion for Efficient Electrocardiogram Classification

Deng, Hanhui, Li, Xinglin, Luo, Jie, Wu, Di

arXiv.org Artificial Intelligence

Electrocardiogram is a useful diagnostic signal that can detect cardiac abnormalities by measuring the electrical activity generated by the heart. Due to its rapid, non-invasive, and richly informative characteristics, ECG has many emerging applications. In this paper, we study novel deep learning technologies to effectively manage and analyse ECG data, with the aim of building a diagnostic model, accurately and quickly, that can substantially reduce the burden on medical workers. Unlike the existing ECG models that exhibit a high misdiagnosis rate, our deep learning approaches can automatically extract the features of ECG data through end-to-end training. Specifically, we first devise EfficientECG, an accurate and lightweight classification model for ECG analysis based on the existing EfficientNet model, which can effectively handle high-frequency long-sequence ECG data with various leading types. On top of that, we next propose a cross-attention-based feature fusion model of EfficientECG for analysing multi-lead ECG data with multiple features (e.g., gender and age). Our evaluations on representative ECG datasets validate the superiority of our model against state-of-the-art works in terms of high precision, multi-feature fusion, and lightweights.


Enhancing Dimensionality Prediction in Hybrid Metal Halides via Feature Engineering and Class-Imbalance Mitigation

Karabin, Mariia, Armstrong, Isaac, Beck, Leo, Apanel, Paulina, Eisenbach, Markus, Mitzi, David B., Terletska, Hanna, Heinz, Hendrik

arXiv.org Artificial Intelligence

We present a machine learning framework for predicting the structural dimensionality of hybrid metal halides (HMHs), including organic-inorganic perovskites, using a combination of chemically-informed feature engineering and advanced class-imbalance handling techniques. The dataset, consisting of 494 HMH structures, is highly imbalanced across dimensionality classes (0D, 1D, 2D, 3D), posing significant challenges to predictive modeling. This dataset was later augmented to 1336 via the Synthetic Minority Oversampling Technique (SMOTE) to mitigate the effects of the class imbalance. We developed interaction-based descriptors and integrated them into a multi-stage workflow that combines feature selection, model stacking, and performance optimization to improve dimensionality prediction accuracy. Our approach significantly improves F1-scores for underrepresented classes, achieving robust cross-validation performance across all dimensionalities.


ML-Tool-Bench: Tool-Augmented Planning for ML Tasks

Chittepu, Yaswanth, Addanki, Raghavendra, Mai, Tung, Rao, Anup, Kveton, Branislav

arXiv.org Artificial Intelligence

The development of autonomous machine learning (ML) agents capable of end-to-end data science workflows represents a significant frontier in artificial intelligence. These agents must orchestrate complex sequences of data analysis, feature engineering, model selection, and hyperparameter optimization, tasks that require sophisticated planning and iteration. While recent work on building ML agents has explored using large language models (LLMs) for direct code generation, tool-augmented approaches offer greater modularity and reliability. However, existing tool-use benchmarks focus primarily on task-specific tool selection or argument extraction for tool invocation, failing to evaluate the sophisticated planning capabilities required for ML Agents. In this work, we introduce a comprehensive benchmark for evaluating tool-augmented ML agents using a curated set of 61 specialized tools and 15 tabular ML challenges from Kaggle. Our benchmark goes beyond traditional tool-use evaluation by incorporating an in-memory named object management, allowing agents to flexibly name, save, and retrieve intermediate results throughout the workflows. We demonstrate that standard ReAct-style approaches struggle to generate valid tool sequences for complex ML pipelines, and that tree search methods with LLM-based evaluation underperform due to inconsistent state scoring. To address these limitations, we propose two simple approaches: 1) using shaped deterministic rewards with structured textual feedback, and 2) decomposing the original problem into a sequence of sub-tasks, which significantly improves trajectory validity and task performance. Using GPT-4o, our approach improves over ReAct by 16.52 percentile positions, taking the median across all Kaggle challenges. We believe our work provides a foundation for developing more capable tool-augmented planning ML agents.





Dataforge: A Data Agent Platform for Autonomous Data Engineering

Wang, Xinyuan, Fu, Yanjie

arXiv.org Artificial Intelligence

B. Hierarchical Routing After data cleaning, to enable efficient and reliable decision-making, we adopt a hierarchical routing architecture, including task-level and action-level reasoning. At the task-level routing, a rule-based router quickly identifies the task type: classification, regression, or unsupervised learning, based on table schema metadata, such as, data types, label structures, and feature distribution. Such lightweight router relies on deterministic heuristics, instead of large language models, thus, enable fast and reliable responses across diverse datasets. At the action-level routing, a compact LLM-based planner refines the decision by selects and plans the most suitable feature-level actions such as, different ordered combinations of feature selection, transformation, or generation, under the identified task (e.g., a classification dataset). Since each router operates within a smaller, well-defined action space, this hierarchical routing approach not only accelerates processing but also avoid invalid or high-risk operations. C. Dual Feedback Loops We develop two collaborative feedback loops to transform the static workflow into an adaptive, self-correcting process, in order to achieve autonomy and continual refinement. 1) Action V alidation Loop for Safety: This feddback loop is to ground actions to ensure operational safety before execution. Each planned action is first grounded through schema alignment, type checking, and logical consistency tests, such as, detecting divisions by zero or invalid type conversions. Only actions that pass validation proceed to execution so as to prevent runtime errors and maintaining workflow integrity.